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MAXIMUM LIKELIHOOD ESTIMATES UNDER K-ALLELE 
MODELS WITH SELECTION CAN BE NUMERICALLY UNSTABLE 

By Erkan Ozge Buzbas^ and Paul Joyce^ 

University of Idaho 

The stationary distribution of allele frequencies under a variety 
of Wright-Fisher fc-allele models with selection and parent indepen- 
dent mutation is well studied. However, the statistical properties of 
maximum likelihood estimates of parameters under these models are 
not well understood. Under each of these models there is a point in 
data space which carries the strongest possible signal for selection, 
yet, at this point, the likelihood is unbounded. This result remains 
valid even if all of the mutation parameters are assumed to be known. 
Therefore, standard simulation approaches used to approximate the 
sampling distribution of the maximum likelihood estimate produce 
numerically unstable results in the presence of substantial selection. 
We describe the Bayesian alternative where the posterior distribution 
tends to produce more accurate and reliable interval estimates for the 
selection intensity at a locus. 

1. Introduction. We begin by introducing a certain amount of termi- 
nology from population genetics. Within each living cell are a certain fixed 
number of chromosomes, threadlike objects that govern the inheritable char- 
acteristics of an organism. At certain positions, or loci, on the chromosomes, 
are genes, the fundamental units of heredity. At each locus there are several 
alternative types of genes or alleles. A diploid organism has chromosomes 
that occur in homologous pairs. The unordered pair of genes situated at the 
same locus of the homologous pair is called a genotype. Thus, if there are 
k alleles, Ai,A2, .. .,Ak at a given locus, then there are k{k + l)/2 possible 
genotypes, (AiAj), I <i <j <k. The fitness of a genotype (AiAj) is deter- 
mined by the reproductive success of individuals carrying that genotype. 



Received July 2008; revised January 2009. 

^Supported both by NSF Grant DEB-0515738 and NIH Grant P20 RR016454 from the 
INBRE Program of the National Center for Research Resources. 
2 Supported in part by NSF Grant DEB-0515738. 

Key words and phrases. Selective overdominance, heterozygote advantage, multiple al- 
lele models, maximum likelihood, posterior analysis, instability. 

This is an electronic reprint of the original article published by the 

Institute of Mathematical Statistics in The Annals of Applied Statistics, 

2009, Vol. 3, No. 3, 1147-1162. This reprint differs from the original in pagination 

and typographic detail. 



1 



2 E. O. BUZBAS AND P. JOYCE 

Understanding the evolutionary forces that shape the patterns of observed 
genetic diversity is central to population genetics. Over evolutionary time, 
genotypes with higher fitness tend to drive out those with lower fitness, 
thus reducing the genetic diversity of a population with respect to the par- 
ticular locus under selection. However, there are selective mechanisms that 
actually promote genetic diversity. A simple such mechanism called het- 
erozygote advantage assumes that carrying two variant copies of a gene at 
a locus (heterozygote) leads to higher fitness in comparison to carrying the 
same copy of the gene (homozygote) . For example, individuals suffering from 
sickle cell anemia carry two copies of a disease gene. On the other hand, indi- 
viduals that carry two copies of the healthy gene are susceptible to malaria. 
In regions with high incidence of malaria, individuals carrying one copy of 
the healthy gene and one copy of the diseased gene have higher fitness be- 
cause they suffer only mild symptoms of anemia while also being resistant 
to malaria [Allison (1956), Cavalli-Sforza and Bodmer (1971), Harding and 
Griffiths (1997)]. As an infectious disease, malaria is a major health threat 
to human populations, affecting approximately 515 million people globally 
[Snow et al. (2005)]. 

For a population with allele frequencies xi, X2, . . . , a^fe, where X]i=i = 1, 
we define the homozygosity to be /i = X]i=i which is the probability that a 
sample of size two will produce two genes with the same allele. A population 
under the influence of heterozygote advantage would likely have low homozy- 
gosity. The lowest possible value occurs when all of the allele frequencies are 
equal (xj = l/fc for all i). Thus, low homozygosity might suggest that a high 
allelic diversity in a population is explained by heterozygote advantage. If 
all genotypes have equal fitness, we call the locus neutral. In this case, all 
of the genetic diversity is produced by mutation from one allele to another. 

While the heterozygote advantage model assumes a diploid population, 
it is mathematically equivalent to a frequency dependent selection model 
which can apply to haploid organisms [Neuhauser (1999)]. One form of fre- 
quency dependence implies that high frequency alleles are at a selective 
disadvantage relative to alleles at low frequency. An example of this regime 
comes from allele frequencies from a bacteria that causes Lyme disease {Bor- 
relia burgdorferi) [Donnelly, Nordborg and Joyce (2001)]. The data [previ- 
ously published in Qiu et al. (1997)] consist of four alleles with frequencies 
x' = (0.103, 0.375, 0.270, 0.252). The observed homozygosity ish = 0.288, rel- 
atively close to the minimum 0.25 under k = 4. Under the neutral model 
assumption, mutations that occurred in the distant past would correspond 
to high frequency alleles and more recent mutations would give rise to low 
frequency alleles. So under neutrality one might expect a few alleles in high 
frequency and most alleles in low frequency, corresponding to relatively high 
homozygosity. At first glance, the homozygosity in the above data appears to 
be too low to be explained by neutrality. Watterson derived the distribution 
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of the homozygosity statistic under the assumption of neutrality which lead 
to one of the most common tests to distinguish departures from neutrality 
[Watterson (1977)]. However, with modern computational methods, we are 
now in a position to go beyond simply determining whether neutrality is 
feasible. We can now develop likelihood based inference methods for pre- 
cise alternative models that incorporate selection and estimate the strength 
of selection. The alternative models presented here are called the Wright- 
Fisher /c-allele models with selection [Wright (1949)]. These classical models 
have a rich mathematical theory and long history in population genetics [see 
Ewens (2004) for background], mainly because they provided theoretical in- 
sight into the dynamics of genes evolving under selection. In the data rich 
world that population genetics finds itself today, there is a renewed interest 
in these models as useful tools to draw inferences on selection at genetic loci 
of interest. 

A brief description of the Wright-Fisher process is as follows. Consider 
tracking a population of 2N genes over many generations. A gene can be one 
of k possible alleles. Generations are non overlapping and the probability of 
sampling genotype {AiAj) is proportional to its fitness, wij = 1 — sij. To ob- 
tain the next generation, 2N pairs of genes are sampled with replacement. 
A randomly chosen allele within each sampled pair mutates to Ai with prob- 
ability tij, independent of the parent's type. A standard diffusion argument 
[Ewens (2004)] based on the Markov chain of allele frequencies generates the 
stationary distribution 



where x is a (/c x 1) column vector of allele frequencies in the population, 
subject to the condition X]f=i = 1; whose transpose is the row vector 
x' = (xi, . . . , Xk)- We have 0' = {6i, . . . , 6k), where 6i = ANui, is the scaled 
mutation parameter for type i and X) = {dij) with cjjj = 2Nsij, the {k x k) 
symmetric matrix of scaled selection parameters. The probability density 
function for allele frequencies under neutrality, /Neut(x|0), is given by the 
familiar Dirichlet distribution 



We will use the notation -E'Ncut(') to represent expectation with respect to the 
neutral density given by equation (2), and -Esei(' |^) to denote expectation 
with respect to the density that incorporates selection given by equation (1). 

The symmetric selective overdominance model is a special case of equa- 
tion (1), obtained by setting Wij = 1, that is {sij = 0) for heterozygotes 
(i 7^ j) and wu = 1 — s, (sa = s > 0) for homozygotes regardless of the 
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specific type and assuming symmetric mutation Oi = for all i. Under 
these assumptions, the matrix of selection parameters reduces to a diagonal 
matrix with equal elements, S = (71^, where 1^ is the k dimensional iden- 
tity matrix, a = 2Ns, and the mutation parameter is 9 = ANu. Therefore, 
X'5]X = crX^iLi ^j^- Substituting appropriately into equation (1) gives 

g-'^Eti^? T{e) 

Despite the wide ranging applications of fc-allele models, the statistical prop- 
erties of estimators are not well understood. This article aims to clarify in- 
ference problems associated with estimators of the selection intensity and 
presents correct frequentist and Bayesian methods for inference under k- 
allele models. Theoretical results describing the large variability in the sam- 
pling distribution of maximum likelihood estimates (MLEs), arising in the 
analysis of /c-allele models with selection in general and of symmetric over- 
dominance in particular, are given. The likelihood of allele frequencies under 
the stationary distribution of the diffusion limit of a Wright-Fisher popu- 
lation is examined and the existence of a numerical instability associated 
with MLEs for certain population compositions is established. Numerical 
instability of MLEs occurs under what would normally be considered ideal 
conditions. These conditions include the assumption that all the mutation 
parameters are either known or can be estimated without error. Also, the al- 
lele frequencies of the entire population are assumed to be observed, rather 
than the usual assumption that the data are viewed as a random sample 
from the population. Even under these idealized conditions, it is shown that 
when the data carry a strong signal for selection, parametric bootstrap is 
inaccurate and unreliable for assessing the strength of selection. 

In Section 2 the theoretical basis for the instability behavior is formalized 
by a theorem: under fc-allele models with selection, there is a singularity 
point on the allele frequency space where the likelihood is unbounded. A 
corollary associated with the theorem is also presented: under the symmet- 
ric overdominance model, the above mentioned singularity arises when all 
the alleles have equal frequencies. This is identified as a perfectly heterozy- 
gous population. This result is surprising since it suggests that appreciable 
information in the data about selection yield poor estimates and MLEs with 
a parametric bootstrap approach cannot be used effectively to estimate the 
strength of selection. Therefore, our findings highlight the limitations of us- 
ing the sampling distribution of the MLEs for inference under A:-allele models 
with selection. 

Section 3 exploits a monotonicity argument to show a correct frequentist 
approach to the problem as well as a Bayesian method as an alternative to 
parametric bootstrap. Data from a Killer-cell immunoglobulin-like receptor 
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Fig. 1. Left: The distribution of the maximum likelihood estimates as a function of pop- 
ulation homozygosity based on equation (3) considering data space that can arise under 
k = 20 on a grid, a asymptotes as h approaches to 1/fc = 0.05. Right: Heavy tailed sampling 
distributions of a, each based on 1000 simulated data sets with 9 = 5. 

(KIR) locus and the Lyme disease bacteria locus discussed above are used 
to demonstrate the methods in Section 4. A comparison of the heterozygote 
advantage with another selective mechanism of importance, the homozygote 
advantage, is given in Section 5, to emphasize that the effect of instability 
in MLEs may be different under different selective scenarios. 

2. Instability of the MLE. A useful summary of the population com- 
position under the overdominance model is the homozygosity statistic H = 
^iXf, which is sufficient for a given 9 and the MLE a{h), is a decreasing 
function oi h = J2i ^1 ■ 

The signal for selection is strong when the population homozygosity is 
small (Figure 1, left). However, the high rate of divergence and unbounded- 
ness of o"(/i) as h approaches to its minimum value 1/A; is unexpected. Small 
perturbations of allele frequencies have a drastic effect on point estimates, 
particularly for small h. For instance, for a highly polymorphic locus with 
A: = 20 possible alleles, where 1/k = 0.05, a 38% decrease in homozygosity 
{h = 0.13 to h = 0.08) corresponds to an approximate 300% increase in a{h); 
[6-(0.13) ^ 350, whereas a(0.08) > 900]. 

We use the numerical methods of [Genz and Joyce (2003), Joyce, Genz 
and Buzbas (2008)] to generate data from the density (3) and to evaluate 
the likelihood based on it. These simulated data are then used to estimate 
the error of a. Assume we condition on 9, so that the sole focus of inference 
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is on a. When a simulated sample with /i ~ (1/A;) is drawn, the resulting a is 
large. Thus, such samples contribute to a heavy right tail for the distribution 
of a. Sampling distributions generated under strong selection clearly reveal 
the substantial effect of the heavy tail on the estimation of a (Figure 1, 
right). Estimates from data sets generated under a variety of {a,9,k) values 
show that the heavy tail in the distribution of cj is a persistent feature. In 
fact, all fc-allele models have a singularity in data space. Theorem 1 below 
gives a precise statement of this general phenomenon. 

Theorem 1. Consider the probability density function /gci(x|0,S) de- 
fined by equation (1) that describes the distribution of allele frequencies at 
stationarity under the Wright-Fisher model with selection and parent inde- 
pendent mutation. There exists a vector of allele frequencies x* = (x^, . . . , x^.)' , 
where /sci(x*[0,S) is unbounded as a function ofS regardless of 9. 

The proof of Theorem 1 is given in Appendix A, where we show that the 
point X* where /sei(x*|0, is unbounded as a function of 5] is found by 
minimizing the quantity J2i j '^ij^i^j subject to the constraint J2i^i = 1- If 
the mean fitness of the population is defined hy w = J2i,j'^ij^i^jj have 
Y^i^jC^ijXiXj = 2A^(1 — w). Therefore, x* is a point in the data space where 
w is optimal. 

The symmetric overdominance model, which has a considerably simpler 
matrix of selection parameters, allows for further results. In this case, both 
the limiting value for a and the point x* at which the likelihood is unbounded 
can be established. 

Corollary 1. Consider the probability density function /sci(x|^, o"), 
defined by equation (3) that describes the distribution of allele frequencies at 
stationarity for the Wright-Fisher symmetric selective overdominance model 
with parent independent mutation. Let x* = . . . , 

a. If 9 is assumed to be known, then, for all allele frequencies x^^x*, the 
maximum likelihood estimate for a is finite. Denote the MLE as a func- 
tion of the homozygosity h = X]^=i by a{h). Then, 

(4) lim (rih) = oo. 

h^(i/k)+ 

b. For all 9>0, if {Xi,X2, ■ ■ ■ jXi^.) has joint probability density given by 
equation (3), then (Xi, X2, . . . , X^) converges in probability to l/fe, . . . , 
l/k) as cr —> 00. 

(See Appendix B for proof). 
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Part (b) of Corollary 1 provides some context for the ill behavior of the 
MLE for a. If a is large, it is highly likely that the population frequencies 
are nearly equal. The asymptotic behavior of populations in the limit as a 
goes to infinity has been studied in other contexts by both Gillespie (1999) 
and Joyce, Krone and Kurtz (2003). 

3. Methods to assess the error in estimates. 

3.1. Estimates using the monotonicity of homozygosity. By exploiting 
the monotonicity of the distribution of the homozygosity statistic, H, we 
develop a frequentist approach to obtain interval estimates of selection in- 
tensity under the selective overdominance model [equation (3)] that does not 
rely on approximating the sampling distribution for the MLE a. Through- 
out we assume that 9 is known and we drop the parameter 9 for notational 
convenience when denoting the cumulative distribution function (cdf) for 
the homozygosity H = X]i=i such that we have 

Fu{h\a) = Psci{H<h\9,a). 

While the cdf is always a monotone function of h for fixed a, this particular 
cdf is also monotone with respect to the parameter a for fixed h. More 
precisely, we can state that as a increases, the probability of H being smaller 
than a particular value h increases. An exact confidence interval for a is 
produced using this monotonicity property in a. 

For a given confidence level (1 — a) and an observed homozygosity H = h, 
we choose (Jl and ajj so that 

(5) Fu{h\&L) =ai, F^{h\au) = 1-02, 

where a = ai+a2- We interpret aL and ajj as the smallest and largest values 
of cr that supports the data. Since F}i{h\a) is a monotone increasing function 
of CT, then (Tl < a < au if and only if FH{h\cTi) < FH{h\a) < FH{h\au), 
which implies ai < FH{h\a) < 1 — a2- Therefore, 

Pscii^L <a<au) = Psci(ai < FH{H\a) < 1 - aa). 

The standard theory shows that F}i{H\a) is a uniformly distributed random 
variable on the interval [0, 1] to give the result 

Pscii^L < cr < au) = I - ai - 02 = I - a. 

Therefore, [(7i,(Tc/] is an exact (1 — a) level confidence interval. 

However, when 9 is unknown, the monotonicity of -ff in cr holds no more 
and only confidence regions are possible to obtain. Therefore, in applica- 
tions where the variability in 9 is expected to be considerable and the joint 
estimation of {9, a) is required, the method is not very useful. Next, we turn 
to the Bayesian approach as an alternative that allows for marginalization 
in a. 
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Fig. 2. A parametric bootstrap sample of size 10000 for a using the Lyme disease data. 
All values of a > 300 are plotted at 300. 

3.2. Estimates based on the Bayesian methods. Assuming independent 
uniform priors on {6, a), the joint posterior distribution of {9, a) is propor- 
tional to the hkehhood, 



which can be sampled using a standard Markov Chain Monte Carlo ap- 
proach. We sampled the joint posterior distribution of the parameters via 
an independent Metropolis-Hastings update [Metropolis et al. (1953), Hast- 
ings (1970)]. The posterior mode is found by numerically maximizing the 
joint distribution and 95% credible intervals are used as a measure of vari- 
ability for a. 

Since both the posterior and the bootstrap are based on the likelihood, 
intuition suggests a similar problem of instability might arise in the Bayesian 
analysis. Examining the posterior sample of a (see examples in Section 4), 
we find that the Bayesian approach does not have the instability observed in 
the bootstrap. The reason is that, in contrast to the parametric bootstrap 
which generates data, posterior analysis works on fixed data. While each 
simulated data set in the bootstrap has a certain probability of falling into 
the instability region, this problem is avoided in posterior simulation, by 
sampling the parameter space rather than the data space. 
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4. Examples. In this section we present two data applications to com- 
pare the three methods discussed above for making inference on selection 
intensity: MLE-bootstrap, monotonicity and the Bayesian approach. 

4.1. Lyme disease data. We revisit (see Section 1) the Lyme disease bac- 
teria data. Recall the data consist of four alleles with frequencies x' = 
(0.103,0.375,0.270,0.252). The observed homozygosity is = 0.288 rela- 
tively close to the minimum 0.25 under k = A. The MLEs for the mutation 
and selection parameters are (0 = 4.8,(7 = 35.1) respectively. A parametric 
bootstrap with (6,a,k) = (4.8,35.1,4) admits poor precision for the esti- 
mated selection intensity as discussed in Section 2. Based on the simulated 
sampling distribution for a (Figure 2), we get an estimated standard er- 
ror of 176.4. The 2.5th percentile of the simulated sampling distribution 
of a corresponds to 17.2 and the 97.5th percentile is 681.3. Therefore, an 
approximate 95% interval estimate based on the parametric bootstrap as- 
sociated with a is (17.2,681.3). Recall that the observed homozygosity is 
0.228, but P{H > 0.288|cr = 681.2) < 0.001, suggesting that the upper bound 
produced by the parametric bootstrap is far too conservative. Conversely, 
P{H < 0.288 1(7 = 17.25) = 0.354, suggesting that the lower bound produced 
by the parametric bootstrap is too large to be reliable. Thus, the parametric 
bootstrap is both unreliable and inaccurate. Using the monotonicity of ho- 
mozygosity with Qi = 02 = 0.025 produces an exact 95% confidence interval 
of (—8, 105) for the Lyme disease data. The upper bound from this method 
performs much better than the bootstrap, as expected. The length of the 
exact confidence interval is over six times smaller than the length of the 
interval produced by the parametric bootstrap. 

A 95% credible interval from the posterior simulation gives (10.8, 124.9). 
The length of the interval produced by the parametric bootstrap is over six 
times larger than the interval estimate produced by the Bayesian method. 

4.2. KIR data. Our second example is from a data set published in Nor- 
man (2004) on KIR genes. The data are from the United Kingdom popula- 
tion [see locus DLl/Sl from Table 2 in Norman (2004)]. The KIR are highly 
polymorphic genes coding for proteins on natural killer cells and they de- 
tect a specific Major histocompatibility complex Class I protein found on 
diseased natural killer cells. Observed levels of variability at these loci sug- 
gest that the heterozygote advantage mechanism is a good candidate to 
explain the variation in KIR genes. The population frequencies are given by 
x' = (0.22,0.21,0.17,0.16,0.15,0.04,0.03,0.02). The homozygosity statistic 
is /i = 0.172, again close to the minimum /imin = 0.125 for k = 8. An ap- 
proximate 95% bootstrap interval estimate for this data set is (21.1,396.4). 
Conditioning on 9, the interval estimate using the monotonicity argument 
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Fig. 3. The empirical distributions of homozygosity under = —10 (left) and au ~ 159 
(right) for KIR data (h — 0.172). The shaded areas correspond to 2.5th and 97.5th per- 
centiles. 
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Fig. 4. A sample from posterior distribution of a for the KIR data set using fixed 9 
(left) and the joint estimation (right). 95% credible interval limits are (6.3, 182.9) and 
(4.3, 205.5) (shades) respectively. 

with ai = Q2 = 0.025 is given as (—10,159) (Figure 3), which is less than 
half the length of the bootstrap interval. 

Fixing 6 at the posterior mode, the KIR data giving a 95% credibility 
interval for a is (6.3,182.9) (Figure 4). For comparison purposes a 95% 
credibility interval for a, (4.3,205.5), is obtained by joint estimation of 6 
and cr is also included in Figure 4. Not surprisingly, the variability in a 
increases when the uncertainty in 6 is taken into account. 

5. Discussion. Wright-Fisher /c-allele models with selection provide a 
flexible framework for considering a wide array of biologically meaningful 
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selective schemes. The impact of the instabihty on the MLEs can vary sub- 
stantially depending on the particular selective scheme. In this section we 
describe an example using the symmetric homozygote advantage model. Our 
goal is to compare the homozygote and heterozygote advantage models in 
terms of the instability explained. The comparison is particularly insightful, 
as the two schemes have very different biological implications and yet there 
is a close connection between their parameterization. 

In contrast to heterozygote advantage, homozygote advantage selects for 
genotypes with the same allelic types. The symmetric version can be ob- 
tained by letting the selection matrix I] = —alk, cr > 0, now denoting the 
relative selective advantage of homozygotes to heterozygotes. Note that the 
difference between this model and the heterozygote advantage consists only 
of switching the sign of the selection parameter. Therefore, both heterozy- 
gote and homozygote advantages can be accommodated in the same model 
by allowing a G M, a useful property to compare the instability regions aris- 
ing under two regimes. Simulated sampling distributions obtained under the 
homozygote advantage display a heavy left tail, reflecting the sign change 
in fj. In light of Theorem 1, which holds for all selective schemes, this result 
is not surprising. The existence of a singularity point for the homozygote 
advantage model is guaranteed. The strongest signal for selection under this 
model is at H = 1 (i.e., at maximum homozygosity) and, in fact, the same 
point turns out to be where a is infinite since this point maximizes the mean 
fitness. 

Interestingly, the effect of the instability on the variability of MLEs un- 
der the homozygote advantage is milder than the heterozygote advantage. 
The difference is explained by examining the populations contributing to 
the tail of interest in the corresponding bootstrap distributions (i.e., left 
and right tails for homozygote and heterozygote advantage models respec- 
tively). They have different probabilities of arising under the two cases. Let 
0<£:<1 — 1/A; and consider populations that are close to perfect homozy- 
gosity {1 — £<H<1), and those that are the same distance from the perfect 
heterozygosity {l/k<H<l/k + £). In the bootstrap procedure, whenever 
a generated data set falls in these regions, a large MLE for the selection 
intensity is obtained. However, the probability of drawing a population in 
the first set, -Psei(l — £ < H < a), is lower than the probability of draw- 
ing a population in the second set, Psci(l/^' < H < 1/k + £\9,a), for a given 
(absolute) selection intensity (Figure 5). In other words, for a given number 
of simulated samples, the expected number of samples to hit the singular- 
ity region is larger under symmetric heterozygote advantage than under the 
homozygote advantage. Hence, the type of selective regime is an important 
factor when evaluating the effect of the instability on the confidence inter- 
vals. Unfortunately, in the important case of the heterozygote advantage 
model, the effect is pronounced. 
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Fig. 5. The probability that a sample falls into the instability region for different se- 
lection intensities (as percent of generated samples with 1000 samples at each a) under 
homozygote advantage (dots) and heterozygote advantage (open circles). The samples are 
generated under k = 10 and the instability region is chosen using e = 0.09 units from perfect 
homozygosity (1, 0.91) and heterozygosity (0.1, 0.19) (see text). 



The ultimate goal for the use of methods discussed in this paper is to de- 
velop statistical methods that can be used to detect selection at multiple loci 
simultaneously under the k allele models. Multiple loci data provide more 
information than single locus data, therefore, inference is expected to be 
more precise. The Bayesian methods gain a definite advantage of flexibility 
as the number of genetic loci, and thus the dimensionality of the problem, 
increases. 

Modern population genetics using coalescent based methods to explain 
polymorphism data have been effectively used to understand genealogy and 
recombination [Fearnhead (2001), Nordborg (2000), Griffiths and Marjoram 
(1997), Padhukasahasram et al. (2008)] but have been less successful with se- 
lection. The computational burden for simulating and analyzing data under 
coalescent based methods with selection remains heavy. There is a renewed 
interest in diffusion approaches which provide an alternative framework to 
handle models with selection [Wakeley (2005)]. Furthermore, diffusion mod- 
els are cornerstones of population genetics theory. Their relatively long his- 
tory resulted in a variety of useful models to investigate selection other than 
the /c-allele setup. An important task is to establish statistical properties of 
estimators and investigate the usefulness of different statistical paradigms 
under these models. 

Finally, counterintuitive results presented in this paper point out that 
care should be exercised in method choice when making inference on selec- 
tion under the class of models we presented. As emphasized above, realistic 
applications of the methods involve inference from multiple loci possibly with 
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complex selective schemes. However, such setups are not ideal to investigate 
the statistical properties of the estimators. Because computationally inten- 
sive methods employed in analyzing them become less tractable, complex 
models tend to hide problems of the type discussed in this paper. Therefore, 
rigorous tests of the methods under the single locus case are essential to 
guarantee the legitimacy of inference on selection made by employing these 
methods. 



APPENDIX A: PROOF OF THEOREM 1 

We begin by fixing 6 and X), then finding a point in data space that 
produces the largest possible signal for selection. Data space is represented 
by the k dimensional simplex defined by = {(xi, X2, . . . , x^) \ J2j=i = 
1}. Since is a compact set, there exists at least one point x* = x*(S) G A^ 

where e"^^'' is optimized. It follows from equation (1) that x* is the point 
in data space that produces the strongest possible signal for selection. Note 
that x* is obtained by minimizing the quadratic function xSx = ^ij^iXj 
subject to the constraint X^j^i = 1- This implies 

(7) ^ cFijX*x* < ^ OijXiXj Vx G Afc. 

i,3 id 

We now turn to the alternative problem where we fix the data x and calculate 
the maximum likelihood estimate for Xl denoted by 5](x). Throughout we 
will assume that 6 is known. Since the likelihood [equation (1)] is a smooth 
function of the parameters, the standard calculus approach to optimization 
based on the derivative of the log likelihood is a valid method for finding 
the MLE. It follows from equation (1) that 

ln/sci(x|0, S) = 



(8) 

= EseliXiXjlYl) - XiXj. 

Therefore, to obtain the MLE for S, denoted by 5](x), we set equation 

(8) equal to zero for each pair of indices i,j and solve for S(x). Thus, for a 
given data set x, we have 

(9) Esci{XiXj\t{^))-XiXj = 0. 
Now multiply equation (9) by aij and sum to obtain 

(10) E'sci I cT'ijXiXj 5^(x) I - ^ (TijXiXj = 0. 
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Now, assume x* satisfying the inequality (7) are the observed data. Then 
it follows from equation (7) that the newly defined random variable 

(11) M* = ^ GijXiXj - ^ ai,x*x* > 0, 

with probability 1. Assume the likelihood given by equation (1) is bounded 
at the point x*. By continuity, the maximum likelihood estimate S(x*) 
exists. Then it follows from equation (10) that 

(12) i?sci(M*|5](x*)) = 0. 

Therefore, M* is a non-negative random variable with mean zero, implying 
M* = with probability 1. However, M* is a continuous function of the 
continuous random vector X and is therefore equal to zero with probability 
zero. So, assuming the likelihood is bounded at x* leads to a contradiction. 



APPENDIX B: PROOF OF COROLLARY 1 



Proof of part a. Because the likelihood given by equation (3) is a 
smooth function of a, standard calculus methods can be used to derive the 
MLE for a. Again assuming that 6 is known and differentiating the natural 
log of the likelihood given in (3), we get 



(13) ^ ln/sei(x|(T, 9) = -h + Es,i{H\a). 

Define g{cr) = Es(.i{H\a), then g is a decreasing function. To show that g 
is decreasing, we show that g'{(j) < for all a. It follows from equation (3) 
that 

and 

^ENeut(e-'^^)i^Ncut(^'e-^) + {E^,^,{He~'^^)f 



5'(^) 



(i?Neut(e"-^))2 

-Varsci(^r)<0. 



That is, as the selection intensity a grows, then homozygotes are increas- 
ingly disadvantaged. Thus, the mean homozygosity, Ea,f,\{H\a), will become 
smaller as the selection intensity, cj, grows. By equation (13), the maximum 
likelihood estimate for a will satisfy the equation 

H = g{a). 
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To establish equation (4), we need to show that for every sequence of ho- 
mozygosities converging to 1/fc from above, the corresponding MLE for the 
selection intensity converges to infinity. Consider a monotone decreasing se- 
quence of real numbers {o„} where a„ — > as n — oo. Define an to be the 
solution to the equation 

g{an) = l/k + an- 

Since is a decreasing function, it follows by monotonicity that cj„ must be 
an increasing sequence. Increasing sequences must either converge or diverge 
to infinity. Suppose for a moment that fT„ ^ u* < oo as n — > oo. Then by 
continuity of g, we have that g{cr*) = l/k. This implies that 

(14) ^sci(^-iAk*) = o. 

However, we know that H >l/k with probability one. Therefore, it follows 
from equation (14) that H — l/k is a nonnegative random variable with 
mean zero. This implies that H — l/k is identically zero with probability 
one. This is a contradiction, since we know that, for any value of a, H is a 
continuous random variable whose distribution can be derived from equation 
(3) and so cannot be equal to 1/A; with probability one. Therefore, an must 
go to infinity. □ 

Proof of part b. Note that g{a) = E{H\a) is a decreasing function 
in a that is bounded below by l/k. Therefore, linio-^oo fi'(<7) must con- 
verge. Denote the limit by b. Note that g~^{b) = oo. By part (a), g~^{b) 
is the maximum likelihood estimate for a when H = b. Since the maxi- 
mum likelihood estimate is finite for all h > /Ik, then b = l/k. Therefore, 
E{H — l/k\a) = E{\H — l/k\ \a) goes to zero as o" — > oo. This implies that 
the Li norm of H converges to l/k, which implies that H converges to 
1/A; in probability. The conclusion of part (b) is established by noting that 
H = l/k if and only if {Xi, X2, . . . , Xk) = {l/k,l/k, . . . ,1/k). □ 
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